PageRank without Hyperlinks: Structural Re-Ranking 
using Links Induced by Language Models 



Oren Kurland 13 Lillian Lee 12 3 

kurland@cs.cornell.edu llee@cs.cornell.edu 

1. Computer Science Department, Cornell University, Ithaca NY 14853, U.S.A. 

2. Language Technologies Institute, Carnegie Mellon University, Pittsburgh PA 15213, U.S.A. 

3. Computer Science Department, Carnegie Mellon University, Pittsburgh PA 15213, U.S.A. 



ABSTRACT 

Inspired by the PageRank and HITS (hubs and authorities) 
algorithms for Web search, we propose a structural re-rank- 
ing approach to ad hoc information retrieval: we reorder the 
documents in an initially retrieved set by exploiting asym- 
metric relationships between them. Specifically, we consider 
generation links, which indicate that the language model in- 
duced from one document assigns high probability to the 
text of another; in doing so, we take care to prevent bias 
against long documents. We study a number of re-ranking 
criteria based on measures of centrality in the graphs formed 
by generation links, and show that integrating centrality into 
standard language-model-based retrieval is quite effective at 
improving precision at top ranks. 

Categories and Subject Descriptors: H.3.3 [Informa- 
tion Search and Retrieval]: Retrieval models 

General Terms: Algorithms, Experimentation 

Keywords: language modeling, PageRank, HITS, hubs, 
authorities, social networks, high-accuracy retrieval, graph- 
based retrieval, structural re-ranking 

1. INTRODUCTION 

Information retrieval systems capable of achieving high 
precision at the top ranks of the returned results would be 
of obvious benefit to human users, and could also aid pseudo- 
feedback approaches, question-answering systems, and other 
applications that use IR engines for pre-processing purposes 
I'll 1 1351 132| . But crafting such systems remains a key re- 
search challenge. 

The PageRank Web-search algorithm |T] uses explicitly- 
indicated inter-document relationships as an additional source 
of information beyond textual content, computing which 
documents are the most central. Here, we consider adapting 
this idea to corpora in which explicit links between docu- 
ments do not exist. 
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How should we form links in a non-hypertext setting? 
While previous work in summarization has applied Page- 
Rank to cosine-based links g], we draw on research demon- 
strating the success of using language models to improve IR 
performance in general |30l |5] and to model inter-document 
relationships in particular |16| . Specifically, we employ gen- 
eration links, which are based on the probability assigned by 
the language model induced from one document to the term 
sequence comprising another. 1 Our use of such links echoes 
the standard language-model-based ranking principle, first 
introduced in |30| . that a document is relevant to the extent 
that its corresponding language model assigns high proba- 
bility to the query. However, given that we are working with 
multiple documents rather than a single query, we employ 
a technique that compensates for length bias in estimating 
generation probabilities. 

We note that the analogy between hyperlinks and gener- 
ation links is not perfect. In particular, one can attribute 
much of the success of link-based Web-search algorithms to 
the fact that hyperlinks are (often) human- provided certi- 
fications that two pages are truly related |13) . In contrast, 
automatically-induced generation links are surely a noisier 
source of information. To compensate, we advocate an ap- 
proach (used elsewhere as well |39l 1101 ITTT1 1201 ETtI 122) ) that 
we term structural re-ranking: we use inter-document rela- 
tionships to compute an ordering not of the entire corpus, 
but of a (possibly unranked) set of documents produced 
by an initial retrieval method. This set should provide a 
reasonable ratio of relevant to non-relevant documents, and 
thus form a good foundation for our algorithms. Note that 
our approach differs in spirit from pseudo-feedback-based 
methods I31|. which define a model based on the initially 
retrieved documents expressly in order to re-rank the entire 
corpus. Indeed, since the quality of the initially retrieved 
results plays a major role in determining the effectiveness 
of pseudo- feedback-based algorithms |35) . our methods can 
potentially serve to greatly enhance the input to them. 

To compute centrality values for a given generation graph, 
we propose a number of methods, including variants of Page- 
Rank 1] and HITS (a.k.a. hubs and authorities) |13| . Com- 
parisons on various TREC datasets against numerous base- 
lines (including use of cosine-based links and re-ranking em- 

1 While the term "generate" is convenient, we do not think 
of a "generator" document or language model as literally 
"creating" others. Other work further discusses th is is sue 
and proposes alternate terminology (e.g., "render") |17|. 



ploying only document-specific characteristics) show that 
language-model-based re-ranking using centrality as a form 
of "document prior" is indeed successful at moving relevant 
documents in the initial retrieval results higher up in the 
list. 



2. STRUCTURAL RE-RANKING 

Throughout this section, we assume that the following 
have been fixed: the corpus C (in which each document has 
been assigned a unique numerical ID); the query q; the set 
2?init C C of top documents returned by some initial re- 
trieval algorithm in response to q (this is the set upon which 
re-ranking is performed); and the value of an ancestry pa- 
rameter a that pertains to our graph construction process. 

For each document d G C, Pd(-) denotes the smoothed 
unigram language model induced from d (estimation details 
appear in Section l2.4fl . We use g and o to distinguish be- 
tween a document treated as a "generator" and a document 
treated as "offspring", that is, something that is generated 
(details below). 

We use the notation (V, wt) for weighted directed graphs: 
V is the set of vertices and wt : V x V — > {y £ 5R : y > 0} 
is the edge-weight function. Thus, there is a directed edge 
between every ordered pair of vertices, but wt may assign 
zero weight to some edges. We write wt(v\ — ► V2) to denote 
the value of wt on edge W2). 

2.1 Generation Graphs 

Our use of language models to form links can be moti- 
vated by considering the following two documents: 

d\: Toronto Sheffield Salvador 
di: Salvador Salvador Salvador 

Knowing that di is important (i.e., central or relevant) would 
provide strong evidence that di is at least somewhat impor- 
tant. However, knowing that d\ is very important does not 
allow us to conclude that di is, since the importance of d\ 
might stem from its first two terms. Using language models 
induced from documents enables us to capture this asymme- 
try in how centrality is propagated: we allow a document 
d to receive support for centrality status from a document 
o only to the extent that Pd(o) is relatively large. (If o is 
not in fact important, the support it provides may not be 
significant.) Note that ranking documents by Pd(<l), as first 
proposed by Ponte and Croft |30) . can be considered a vari- 
ation of this principle. 

We are thus led to the following definitions. 

Definition 1. The top a generators of a document d £ 
X?init, denoted TopGen(d) , is the set of a documents g £ 
Cinit — {d} that yield the highest p g (d) , where ties are broken 
by document ID. (We suppress a in our notation for clarity.) 

Definition 2. The offspring of a document d G Omit are 
those documents that d is a top generator of, i.e., the set 
{o e Di»it : d e TopGen(o)}. 

Note that multiple documents can share offspring, and that 
it is possible for a document to have no offspring. 

We can encode top-generation relationships using either 
of two generation graphs Gu = (2?init) wtu) and Gw = 



(X>initi wt w ), where for o,g € T> init , 
wtu(o -> g) 

p g (o) if g 6 TopGen(o) 



1 if g e TopGen(o), 
otherwise; 



wtw(o —^g) — 







otherwise. 



Thus, in both graphs, positive-weight edges lead only from 
offspring to their respective top a generators; but Gu treats 
(edges to) the top generators of o uniformly, whereas Gw 
differentially weights them by the probability their induced 
language models assign to o. 

Some of our algorithms require "smoothed" versions of 
these graphs, in which all edges (including self-loops) have 
non-zero weight, to work correctly. To be specific, we employ 
PageRank's £Q smoothing technique. 

Definition 3. Given an edge-weighted directed graph G = 
(©mit, wt) and smoothing parameter A £ [0, 1), the smoothed 
graph G' A ' = (X>mit, wt^) has edge weights defined as fol- 
lows: for every o,g£ ©mit- 



wt^(o^g) = (l-\) 



Pinit 



+ A. 



wt{o -> g) 



£ < 



wt(o 



The weights of all edges leading out of any given node in G*' A ' 
sum to 1 and thus may be treated as transition probabilities. 

With these concepts in hand, we can now phrase our 
centrality-determination task as follows: given a generation 
graph, compute for each node (i.e., document) how much 
centrality is "transferred" to it from other nodes — by our 
edge- weight definitions, centrality therefore corresponds to 
the degree to which a document is responsible for "generat- 
ing" (perhaps indirectly) the other documents in the initially 
retrieved set. We now consider different ways to formalize 
this notion of transferrence of centrality. 

2.2 Computing Graph Centrality 

A straightforward way to define the centrality of a docu- 
ment d with respect to a given graph G = (IWt, tvt) is to 
set it to d's weighted in-degree, which we call its influx: 



Cen I (d;G) d = wt(o ^ d). 



(1) 



oGZ>j, 



The Uniform Influx algorithm sets G = Gu, so that the 
only thing that matters is how many offspring d has; it is 
thus reminiscent of the journal impact factor function from 
bibliometrics 0, which computes normalized counts of ex- 
plicit citation links. The Weighted Influx algorithm sets 
G = Gw, so that the generation probabilities that d assigns 
to its offspring are factored in as well. 

As previously noted by Pinski and Narin in their work 
on influence weights |29| . one intuition not accounted for by 
weighted in-degree methods is that a document with even 
a great many offspring should not be considered central (or 
relevant) if those offspring are themselves very non-central. 
We can easily modify Equation^to model this intuition; we 
simply scale the evidence from a particular offspring doc- 
ument by that offspring's centrality, thus arriving at the 
following recursive equation: 

Cen RI (d; G) d = ^ wt(o -> d) ■ Cen RI (o; G), (2) 



where we also require that 2~2deT>- t Cenm(d;G) = 1. Un- 
fortunately, for arbitrary Gjj and Gw, Equation [5] may not 
have a unique solution or even any solution at all under the 
normalization constraint just given; however, a unique so- 
lution is guaranteed to exist for their PageRank-smoothed 
versions. 2 By analogy with the two influx algorithms given 
above, then, we have the Recursive Uniform Influx al- 
gorithm, which sets G = G^ and is a direct analog of 
PageRank, and the Recursive Weighted Influx algorithm, 
which sets G = G w ^ . 

2.3 Incorporating Initial Scores 

The centrality scores presented above can be used in iso- 
lation as criteria by which to rank the documents in Vinit- 
However, if available, it might be useful to incorporate more 
information from the initial retrieval engine to help handle 
cases where centrality and relevance are not strongly corre- 
lated. (Recall that it participates in any case by specifying 
the set Dinit-) In our experiments, we explore one concrete 
instantiation of this approach: we apply language-model- 
based retrieval 1301 |2") to determine Dmit, and consider the 
following family of re-ranking criteria: 

Cen(d;G)-p d (q), (3) 

where d € 2?mlt and Cen is one of the centrality functions 
defined in the previous section. This gives rise to the al- 
gorithms Uniform Influx-\-LM, Weighted Influx-\-LM, 
Recursive Uniform Influx-\-LM, and Recursive Weighted 
Influx+LM. 

Incidentally, our choosing pd(q) as initial score function 
has the interesting consequence that it suggests interpreting 
Cen(d; G) as a document "prior" — in fact, Lafferty and 
Zhai write, "with hypertext, [a document prior] might be 
the distribution calculated using the 'PageRank' scheme" 
|18| . We will return to this idea later. 

2.4 Estimating Generation Probabilities: Length 
and Entropy Effects 

Generation probabilities form the basis for the graphs on 
which our algorithms are defined. This section describes our 
method for estimating these probabilities. 

Let tf(ui € x) denote the number of times the term to 
occurs in the text or text collection x. What is often called 
the maximum-likelihood estimate (MLE) of w with respect 
to x is defined as 

-MLE, s de.f tf(w 6 x) 

Px {W) = ^ , - ■ 

Some prior work in language-model-based retrieval [221 1 
employs a Dirichlet- smoothed version: 

M . dej tfQ 6 x) + u-p£ ILE {w) 

the smoothing parameter /i controls the degree of reliance on 
relative frequencies in the corpus rather than on the counts 
in x. Both estimates just described are typically extended 

2 The edge weights correspond to the transition probabili- 
ties for a Markov chain that is aperiodic and irreducible, 
and hence has a unique stationary distr ibution |S] that can 
be computed by a variety of means |34llo1l7|. In our exper- 
iments, power iteration converged very quickly. 



to distributions over term sequences by assuming that terms 
are independent: for an n-term text sequence w\W2 ■ ■ ■ u) n , 

n 

MLE, \ def TT — MLE / \ 

Px (W1W2 ■ ■ ■ W n ) = [[Px [Wj); 

3=1 

n 

pL Ml (wi w 2 ---w n ) d = Y[ pj' l] (wj). 

3=1 

Another estimation approach, which we adopt, incorporates 
the Kullback-Leibler divergence D between document lan- 
guage models £53 El ( see al so previously proposed ranking 
principles 26 18): unless otherwise specified, for document 
d and word sequence s (in our setting, either a document or 
the query), we set Pd(s) to 

pf-( S ) W exp (-D (p~r E (•) || pf (•))) ■ (4) 
Equation0|has some useful properties. We can show that 

V^{s) = (p l r ] (s))Wl -eMH(Ps MLE '(■))), 

term A term B 

where H is the entropy function. Now, observe that for 
both p x ALE (-) and pll^i'), longer text sequences tend to be 
assigned lower probabilities; this would correspond to an 
unmotivated reduction of weights for edges out of long docu- 
ments in the graph Gw- However, Term A length-normalizes 
Pd(s) via the geometric mean, which has helped amelio- 
rate numerical problems in previous work |19) . Addition- 
ally, term B raises the generation probability for texts with 
high-entropy MLE term distributions. High entropy may be 
correlated with a larger number of unique terms — for exam- 
ple, we get an entropy of for the document "Salvador Sal- 
vador Salvador" but log 3 for "Toronto Sheffield Salvador" 
— which, in turn, has previously been suggested as a cue for 
relevance 1 3 3 111 II . Hence, generators of documents inducing 
high-entropy language models may be good candidates for 
centrality status. (We hasten to point out, though, that for 
the algorithms based on smoothed graphs (Definition|3Jl, the 
entropy term cancels out due to our normalization of edge 
weights.) 

3. RELATED WORK 

Work on structural re-ranking in traditional ad hoc in- 
formation retrieval has mainly focused on query-dependent 
clustering, wherein one seeks to compute and exploit a clus- 
tering of the initial retrieval results HU1 12771 1371 
Clusters represent structure within a document set, but do 
not directly induce an obvious single criterion or principle 
by which to rank documents; for instance, they have been 
used to improve rankings indirectly by serving as smooth- 
ing mechanisms |22| . Interestingly, some centrality measures 
have been previously employed to produce clusterings |36| . 

There has been increasing use of techniques based on 
graphs induced by implicit relationships between documents 
or other linguistic items H El IH H HI HH EH The work 
in the domain of text summarization HI 12 11 most resembles 
ours, in that it also computes centrality on graphs (although 
the nodes correspond to sentences or terms instead of doc- 
uments). Perhaps the main contrast with our work is that 
links were not induced by generation probabilities; Section 
I4.2| presents the results of experiments studying the relative 
merits of our particular choice of link definition. 



Our centrality scores constitute a relationship-based re- 
ranking criterion that can serve as a bias affecting the initial 
retrieval engine's scores, as in Equation|3] Alternative biases 
that are based on individual documents alone have also been 
investigated. Functions incorporating document or average 
word length |1 1 1 1141 1251 are applicable in our setting; we 
report on experiments with (variants of) document length 
in Section [4.21 Other previously suggested biases that may 
be somewhat less appropriate for general domains include 
document source |25| and creation time |21|. and webpage 
hyperlink in-degree and URL form |15|. 

4. EVALUATION 

4.1 Experimental Setting 

The objective of structural re-ranking is to (re-)order an 
initially-retrieved document set £>i n it so as to improve preci- 
sion at the very top ranks of the final results. Therefore, we 
employed the following three evaluation metrics: the preci- 
sion of the top 5 documents (prec@5), the precision of the 
top 10 documents (prec@10), and the mean reciprocal rank 
of the first relevant document (MRR) |32|. 

We are interested in the general validity of the various 
structural re-ranking methods we have proposed. We be- 
lieve that a good way to emphasize the effectiveness (or lack 
thereof) of the underlying principles is to downplay the role 
of parameter tuning. Therefore, we made the following de- 
sign decisions, with the effect that the performance numbers 
we report are purposely not necessarily the best achievable 
by exhaustive parameter search: 

• The initial ranking that created the set £>i n it was built 
according to the function p^ L '^{q) where the value of 
jj, was chosen to optimize the non-interpolated average 
precision of the top 1000 retrieved documents. This 
is not one of our evaluation metrics, but is a reason- 
able general-purpose optimization criterion. (In fact, 
results with this initial ranking turned out to be statis- 
tically indistinguishable from the results obtained by 
optimizing with respect to the actual evaluation met- 
rics, although of course they were lower in absolute 
terms.) 

• We only optimized settings for a (the ancestry parame- 
ter controlling the number of top generators considered 
for each document) and A (the edge-weight smoothing 
factor) with respect to precision among the top 5 docu- 
ments, not with respect to all three evaluation metrics 
employed. 

The search ranges for the latter two parameters were: 

a: 4,9,19,...,|7?init|-l 

A: 0,0.05,0.1,0.2, ...,0.9,0.95 

As it turned out, for many instances (except for the Weighted 
Influx algorithm), the optimal value of a with respect to 
precision at 5 was either 4 or 9, suggesting that a relatively 
small number of generators per document should be consid- 
ered when constructing the graph. In contrast, A exhibited 
substantial variance in optimal value for precision at 5 in 
some of our datasets. We set |2?i n it|, the number of initially- 
retrieved documents, to 50 in all results reported below (sim- 
ilar performance patterns were obtained when |X>i n it | = 100). 

The remaining details are as follows. We conducted our 
experiments on the following four TREC corpora: 



corpus 


# of docs 


queries 


disk(s) 


AP89 


84,678 


1-46,48-50 


1 


AP 


242,918 


51-64, 66-150 


1-3 


WSJ 


173,252 


151-200 


1-2 


TREC8 


528,155 


401-450 


4-5 



(AP89 is a subset of AP containing articles just from the 
year 1989). All documents and queries (in our case, TREC- 
topic titles) were stemmed using the Porter stemmer and to- 
kenized, but no other pre-processing steps were applied. We 
used the Lemur toolkit |27| for language-model estimation. 
Statistically-significant differences in performance were de- 
termined using the two-sided Wilcoxon test at a confidence 
level of 95%. 

4.2 Results 

In the tables that follow, we use the following abbrevia- 
tions for algorithm names. 



U-In 
W-In 


Uniform Influx 
Weighted Influx 


R-U-In 
R-W-In 


Recursive Uniform Influx 
Recursive Weighted Influx 


U-In+LM 
W-In+LM 


Uniform Influx+LM 
Weighted Influx+LM 


R-U-In+LM 
R-W-In+LM 


Recursive Uniform Influx+LM 
Recursive Weighted Influx+LM 



4.2.1 Primary evaluations 

Our main experimental results are presented in Table Q 
The first three rows specify reference-comparison data. The 
initial ranking was, as described above, produced using p d i,M (q) 
with (j, chosen to optimize for non-interpolated precision at 
1000. The empirical upper bound on structural re-ranking, 
which applies to any algorithm that re-ranks Omit, indicates 
the performance that would be achieved if all the relevant 
documents within the initial fifty were placed at the top 
of the retrieval list: note that these bounds indicate that 
the initial rankings for AP89 are quite worse than those for 
the other three corpora. We also computed an optimized 
baseline for each metric m and test corpus C; this consists 
of ranking all the documents (not just those in ZWt) by 
Pd with M chosen to yield the best m-results on C. As 

a sanity check, we observe that the performance of the initial 
retrieval method is always below that of the corresponding 
optimized baseline (though not statistically distinguishable 
from it). 

The first question we are interested in is how our struc- 
tural re-ranking algorithms taken as a whole do. As shown 
in Table our methods improve upon the initial ranking 
in many cases, specifically, roughly 2/3 of the 96 relevant 
comparisons (8 centrality-based algorithms x 4 corpora x 3 
evaluation metrics). An even more gratifying observation is 
that Table shows (via italics and boldface) that in many 
cases, our algorithms, even though optimized for precision 
at 5, can outperform a language model optimized for a dif- 
ferent (albeit related) metric m even when performance is 
measured with respect to m; see, for example, the results 
for precision at 10 on the AP corpus. 

Closer examination of the results in Table reveals that 
in about 60% of the 48 relevant comparisons, our algorithms 
not only are at least as effective when applied to the graph 
Gw as when applied to Gu, but often yield better perfor- 
mance results; the comparison between Recursive Weighted 





AP89 


AP 


WSJ 


TREC8 


precOb 


precOlU 


MRU 


prec<0>5 


precSlU 


MRR 


precOb 


precOlO 


MRR 


prec<Q5 


precOlO 


MRR 


upper bound 


63.7 


53.1 


75.5 


87.6 


78.8 


93.0 


89.6 


80.0 


100.0 


94.4 


85.0 


98.0 


iriit. ranking 


28. 3 


20.0 


02. 6 


40. / 




oy.o 


04.8 


Ail A 
4o.4 


1 0.2 


ou.u 


A K C\ 
40.0 


oy.i 


opt . baselines 


30.0 


27.4 


54.3 


46.5 


43.9 


63.5 


56.0 


49.4 


77.2 


51.2 


46.4 


69.6 


U-In 


29.6 


27.8 


39.5 


50.9 


49.0 I 


66.3 


50.0 


46.6 


66.7 


50.0 


45.0 


62.0 


W-In 


31.3 


29.6 


46.8 


51 .3 


48.7 % 


64-4 


52.0 


47.8 


63.3 


49.2 


43.4 


63.7 


U-In+LM 


nn C 
OO.O 




40.0 


01 .J 


a n A % 
49.4 


b3.2 


OO.4 


4y.2 


70 n 
1 3.6 


0Z.0 


en n ; 

52. U D 


bb.O 


ITT T i T T\ X 

W-In+LM 


31.7 


27.6 


48.4 


51.1 ' 


48-4 o 


63.0 


57.2 


50.0 


77.2 


51.6 


49.6 ' 


64.5 


R-U-In 


31.3 


28.9 


46.4 


51.5 


48.9 1 


63.4 


53.6 


49.6 


68.5 


52.0 


44.6 


66.5 


R-W-In 


32.2 


29.6 


40.5 „ 


52.1 * 


49.1 I 


63.9 


54.0 


49.2 


70.2 


52.4 


44.6 


66.5 


R-U-In+LM 


33.0 


29.3 


45.8 


52.1 I 


49-2 I 


64.3 


58.8 1 


51.0 1 


78.6 


55.6 


46.0 


68.4 


R-W-In+LM 


33.5 


29.8 


46.0 


52.9 I 


49.0 I 


62.6 


58.8 1 


50.6 


78.6 


56.0 


45.8 


67.6 



Table 1: Primary experimental results, showing algorithm performance with respect to our 12 evaluation 
settings (3 performance metrics x 4 corpora). For each evaluation setting, improvements over the optimized 
baselines are given in italics; statistically significant differences between our structural re-ranking algorithms 
and the initial ranking and optimized baselines are indicated by i and o respectively; bold highlights the best 
results over all ten algorithms. 

Notice that even though the structural re-ranking algorithms were optimized for prec@5 only (and produce 
the best results for this metric), they still perform well with respect to the other two metrics. 



Influx (R-W-In) and Recursive Uniform Influx (R-U-In) is a 
good example. These results imply that it is a bit better to 
explicitly incorporate generation probabilities into the edge 
weights of our generation graphs than to treat all the top 
generators of a document equally. 

Another observation we can draw from Table Q is that 
adding in query-generation probabilities as weights on the 
centrality scores (see Equation |3J tends to enhance perfor- 
mance. This can be seen by comparing rows labeled with 
some algorithm abbreviation "X" against the correspond- 
ing rows labeled "X+LM": about 80% of the 48 relevant 
comparisons exhibit this improvement. Most of the coun- 
terexamples occur in settings involving precision at 10 and 
MRR, which we did not optimize our algorithms for. 

Similarly, by comparing "Y" -labeled rows with "R-Y"- 
labeled ones, we see that in about 70% of the 48 relevant 
comparisons, it is better to use the recursive formulation of 
Equation |21 where the centrality of a document is affected 
by the centrality of its offspring, than to ignore offspring 
centrality as is done by Equation 

Perhaps not surprisingly, then, the Recursive Uniform In- 
flux+LM and Recursive Weighted Influx+LM algorithms, 
which combine the two preferred features just described (re- 
cursive centrality computation and use of the initial search 
engine's score function) appear to be our best performing 
algorithms: working from a starting point below the op- 
timized baselines, they improve the initial retrieval set to 
yield results that even at their worst, are not only clearly 
better than the initial ranking for precision at 5 and 10, but 
are also merely statistically indistinguishable from the opti- 
mized baselines. Moreover, in one setting (AP, precision at 
10) they actually produce statistically significant improve- 
ments over the optimized baseline even though they were 
not optimized for that evaluation metric. 

It is interesting to note that the relative performance of 
our algorithms does not seem to depend strongly on the 
quality of the initial ranking, in the following sense. The 
average percentage of relevant documents among the 50 that 
are initially retrieved is 21%, 35.5%, 33.3% and 30.3% for 



AP89, AP, WSJ and TREC8, respectively, but the relative 
improvements for precision at 5 and 10 that our algorithms 
achieve with respect to the initial ranking are almost always 
higher on AP89 than on WSJ or TREC8. 

4.2.2 Links based on the vector-space model 

We have advocated the use of generation relationships 
to define centrality, where these asymmetric relationships 
are based on language-model probabilities. However, other 
inter-document relationships have been previously exploited 
in information retrieval. Perhaps the most well-known is 
vector-space proximity, with the cosine frequently used as 
(symmetric) closeness metric; indeed, as mentioned above, 
previous work in summarization |1] has used the cosine to 
determine centrality in ways very similar to the ones we have 
considered. It is thus important to examine whether the 
performance improvements we have achieved can be repro- 
duced, or even surpassed, by the use of vector-space-based 
links rather than language-model-based generation links. 

To run this evaluation, we simply modified Definition 
and all eight of our structural re-ranking algorithms to use 
the cosine of the angle between log tf.idf document vectors, 
rather than language-model probabilities, to form the ba- 
sis for determining the edge weights of our graphs. (Note 
that the fact that the cosine is symmetric does not imply 
that edges («i,«a) and (v2,vi) get the same weight even 
in our non-smoothed graphs — document di being a top 
"generator" of d,2 with respect to the cosine does not imply 
the reverse.) It should be observed that the language-model 
weights on centrality scores (i.e., the Pd(q) term in Equation 
|3J on which the U +LM" algorithms are based) were not re- 
placed with cosine values, which makes sense since we want 
our comparison to focus on the effect of different means of 
computing graph-based centrality. 

Table [5] depicts the relative performance differences be- 
tween using our language-model-based graphs and graphs 
induced using vector-space proximity in the manner just 
described. For each choice of algorithm, evaluation mea- 
sure, and dataset, we indicate which formulation, if any, 





U-In W-In U-In+LM W-In+LM R-U-In R-W-In R-U-In+LM R-W-In+LM 


prec @5 
AP89 prec @10 
MRR 


□ □ □ 

♦ ♦ 

□ □ □ □ □ □ 


prec @5 
AP prec @10 
MRR 


♦ ♦♦ ♦♦♦ ♦ ♦ 

♦ ♦♦ ♦♦♦ ♦ ♦ 

♦ ♦ ♦ ♦ ♦ 


prec @5 
WSJ prec @10 
MRR 


♦ ♦ 

□ 


prec @5 
TREC8 prec @10 
MRR 


♦ ♦ ♦ ♦ ♦ ♦ 

♦ ♦ ♦ ♦ 

□ □ 



Table 2: Structural re-ranking based on language models (LM) vs. structural re-ranking based on cosine- 
measured vector-space proximity (VEC). We indicate the settings in which the relative difference was at least 
5% with either a "♦" (LM superior) or a "□" (VEC superior). 





AP89 


AP 


WSJ 


TREC8 


prec<St> 


prec® 10 


MRR 


prec(Q5 


precOlU 


MRR 


precOb 


prec(Q10 


MRR 


prec(Q5 


precOlO 


MRR 


uniform (= init) 


28.3 


26.5 


52.3 


45.7 


43.2 


59.6 


54.8 


48.4 


76.2 


50.0 


45.6 


69.1 


W-ln 


31.7 


27.6 


48.4 


51.1* 


48.4* 


63.0 


57.2 


50.0 


77.2 


51.6 


49.6* 


64.5 


R-W-ln 


33.5 


29.8 


46.0 


52.9* 


49.0* 


62.6 


58.8* 


50.6 


78.6 


56.0 


45.8 


67.6 


length 


29.1 


24.3 


50.8 


41.6 


41.4 


55.3 


44.4* 


42.4* 


64.6* 


47.2 


41.4 


64.2 


log(length) 


30.4 


27.0 


52.5 


45.3 


43.2 


60.6 


57.2 


49.0 


69.8* 


49.6 


46.8 


69.2 


entropy 


30.0 


26.5 


52.6 


46.1 


42.5 


60.8 


56.8 


48.6 


71.1* 


49.6 


46.8 


ri. r* 


uniq'l'erms 


27A 


24.8 


52.3 


42.0 


41.3 


56.2 


50.0 


44.6 


68.8 


49.2 


44.2 


71.2 


log(uniqTerms) 


30.4 


27.0 


52.5 


45.9 


42.3 


60.8 


57.2 


49.0 


70.0* 


49.6 


47.2 


70.0 



Table 3: Comparison between our use of language-model-based structural-centrality scores in Equation l3l vs. 
non-structural re-ranking heuristics. For each evaluation setting, italics mark improvements over the default 
baseline of uniform centrality scores, stars (*) indicate statistically significant differences with this default 
baseline, and bold highlights the best results over all eight algorithms. 



resulted in at least 5% relative improvement with respect to 
the other. As can be seen, in at least three of our four cor- 
pora, our language-modeling approach seems to be a more 
effective basis for determining document centrality than the 
vector-space/cosine. We hasten to point out, though, that 
in most instances, vector-space proximity yielded better per- 
formance than the corresponding baselines (the results are 
omitted since the precise numerical comparison does not 
yield additional information); this finding provides further 
support to the idea that the overall structural re-ranking 
approach is a flexible and effective paradigm that can incor- 
porate different types of inter-document relationships when 
appropriate. 

4.2.3 Inducing centrality with the HITS algorithm 

One well-known alternative method for computing cen- 
trality in a graph is the HITS algorithm |13| . originally pro- 
posed for Web search. There has been some work utilizing 
it for text summarization in non-Web domains as well |23|. 
The reason we have not yet discussed it in detail is that 
it differs conceptually from our proposed algorithms in an 
important way: two different notions of centrality are iden- 
tified, represented by hub and authority scores. While the 
concepts of hubs and authorities are highly suitable for Web- 
search scenarios, it is less clear whether it is useful in our 
setting to distinguish between the two. 



As a preliminary investigation, we experimented with us- 
ing hub and authority scores as measures of centrality on 
the generation graphs we built. Space constraints preclude 
a detailed discussion, but the results may be summarized 
as follows. We found that authority scores yielded better 
performance than hub scores, and that the results were gen- 
erally at least as good as or better than those for the opti- 
mized baselines. However, they were slightly inferior in sev- 
eral cases to those of the corresponding influx algorithms. 
Thus, it seems that our method for graph construction can 
support a variety of different algorithms, but that the HITS- 
style hubs/authorities distinction may not be effective for 
the task we have addressed. 

4.2.4 Non-structural re-ranking 

So far, we have discussed the use of graph-based centrality 
as a re-ranking criterion, the idea being that relationships 
between documents can serve as an additional source of in- 
formation. Our best empirical results seem to be produced 
by using the weighted formulation given in Equation |H] from 
Section 1231 

Cen(d;G)- Pd (q). 

Since, as noted above, in this equation Cen(d; G) can be 
regarded as a "prior" on documents, it is natural to ask 
whether other previously- proposed biases on generation prob- 



abilities might prove similarly useful. The comparison is es- 
pecially interesting because these biases have tended to be 
isolated-document heuristics; we thus refer to their use as a 
replacement for Cen(d;G) as "non-structural re-ranking". 

Document length has been employed several times in the 
past to model the intuition that longer texts contain more 
information 1141 We refine this hypothesis to disen- 
tangle several distinct notions of information: the number of 
tokens in a document, the distribution of these tokens, and 
the number of types ( "Salvador Salvador Salvador" contains 
three tokens but only one type). Thus, as substitutions for 
centrality in the above expression, we consider not only doc- 
ument length, but also the entropy of the term distribution 
and the number of unique terms (used as the basis for piv- 
oted unique normalization in |33|). As baseline, we took the 
initial retrieval results; note that doing so corresponds to 
using a uniform bias, or, equivalently, using no bias at all. 

As can be seen in Table [3] taking the log of token or 
type count is an improvement over using the raw frequencies, 
often yielding above-baseline performance. The entropy is 
more effective than raw frequency of either tokens or types, 
and in two cases leads to the best performance overall. How- 
ever, in the majority of settings, structural re-ranking gives 
the highest accuracies. 

4.2.5 Re-ranking vs. ranking 

We posed our centrality-computation techniques as meth- 
ods for improving the results returned by an initial retrieval 
engine, and showed that they are successful at accomplish- 
ing this goal. But one can ask whether it is necessary to 
restrict our attention to an initial pool Z?mi t ; that is, would 
we expect similarly good results if we based our generation 
graphs on the entire corpus? As it happens, preliminary ex- 
periments with the Recursive Uniform Influx+LM and Re- 
cursive Weighted Influx+LM algorithms on two full corpora 
(AP89 and LA combined with FR) showed that one would 
be better off sticking with the standard language-modeling 
approach if no pre-filtering of documents is available. 

We do not see this finding as surprising, for our intuition is 
that in the re-ranking case, there is a more direct connection 
between centrality and relevance since we can assume that 
relevant documents comprise a reasonable fraction of the 
initial retrieval results. 

5. CONCLUSION 

We have proposed and evaluated a number of methods for 
structural re-ranking using inter-document generation rela- 
tionships based on language models. Our main experiments 
showed that even non-optimized instantiations of our over- 
all approach yield results rivaling those of optimized base- 
lines. Further analysis revealed that generation relation- 
ships seem more effective within our centrality-computation 
framework than relationships based on vector-space proxim- 
ity do, and that using inter-document relationships seems 
to be a promising alternative to employing the isolated- 
document heuristics we implemented (several of which were 
novel to this study). Based on our results, we believe that 
exploring other methods for combining statistical language 
models and explicitly graph-based techniques is a fruitful 
line for future research. 
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